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Abstract 

We investigate distances on binary (presence/absence) data in the context of a 
Dollo process, where a trait can only arise once on a phylogenetic tree but may 
be lost many times. We introduce a novel distance, the Additive Dollo Distance 
(ADD), which is consistent for data generated under a Dollo model, and show that 
it has some useful theoretical properties including an intriguing link to the LogDct 
distance. Simulations of Dollo data are used to compare a number of binary dis- 
tances including ADD, LogDet, Nei Li and some simple, but to our knowledge pre- 
viously unstudied, variations on common binary distances. The simulations suggest 
that ADD outperforms other distances on Dollo data. Interestingly, we found that 
the LogDet distance performs poorly in the context of a Dollo process, which may 
have implications for its use in connection with conditioned genome reconstruction. 
We apply the ADD to two Diversity Arrays Technology (DArT) datasets, one that 
broadly covers Eucalyptus species and one that focuses on the Eucalyptus series Ad- 
nataria. We also reanalyse gene family presence/absence data on bacteria from the 
COG database and compare the results to previous phylogenies estimated using the 
conditioned genome reconstruction approach. 

KEY WORDS: Additive Dollo Distance, Dollo process, LogDet/paralinear dis- 
tances, Diversity Arrays Technology, Eucalyptus phylogeny, Adnataria phylogeny, 
gene content phylogeny, conditioning genomes. 

A fundamental idea in evolutionary biology is that when two species share a complex 
trait the most likely explanation of the similarity is that both species have inherited the 
trait from a common ancestor. However, the absence of a particular trait carries far less 
information, for instance wings and eyes are complex traits that have been lost many 
times independently in different parts of the evolutionary tree of life. As long ago as 1893 
Dollo captured this idea in what is now known as Dollo's Law which states that complex 
traits may be gained once somewhere in evolutionary history, and may be subsequently 
lost independently many times. 

In this paper we will only consider Dollo models that generate binary data, recording 
the presence or absence of some trait. For example -- does an organism have any genes 



in a particular gene family? Does it have some skeletal feature, such as the mammalian 
inner ear? When its DNA is digested by a mix of restriction enzymes, is a particular 
DNA fragment produced? In reality, determining if a trait is present or absent can be 
less clear cut; for example, paralogues can confound gene presence/absence decisions. In 
a stochastic Dollo model the gain and loss of such traits is treated as arising from a 
simple probability model. While few situations match Dollo's Law exactly it provides a 



useful model in many evo 
understand gene fami lies ( 



morphological traits ( Goul 



utionary scenarios of interest. Dollo mode 



Huson and Steel 



ill97oL 



2004 



Pagan and Martin 



s have been used to 



20071 ) and complex 



cognates in language evolution (jRyder and Nicholls 



and stochastic Dollo models have been used to study 



2011 



Nicholls and Gray 



20081 ). 



In some of these scenarios such data can have another interesting property: only traits 
that are present in particular reference taxa are visible; or, in other words, the data are 
censored. This happens, for example, with array-based studies where a small set of taxa 
is used to create a set of traits (i.e. a set of DNA fragments that make up an array) to 
which other taxa can be compared. The idea of a reference taxon also has parallels in the 
gene-c ontent setting where so me authors have proposed "conditioned genome reconstruc- 



tion" ( iLake and Rivera 



2004J), where one genome is selected as a reference and, for the 



remaining genomes, only gene families present in the reference genome are analyzed. 



Data thought to follow Do 



mony approach fjQuesne 



1974 



l o's Law trad itionally have been analysed using a parsi- 



Farris . 



19771 ). As is normally the case with parsimony 



approaches, branch length infor mation is not taken into account. The use of stoch astic 



Dollo models is relatively recent (IRvder and Nicholls 



2011; 



Nicholls and Gray 



2008|) and, 



so far, they have only been implemented in a Bayesian framework. Bayesian methods are 
computationally intensive so there is a need for an approach that is both computationally 
efficient and statistically consistent. 
This motivated us to develop a distance-based approach to Dollo data. Our initial mo- 



tivation was to derive a distance suitab 



Technology (DArT) data (IJaccoud et al 



e for p hylogenetic analysis of Diversity Array 



20011 ) which, by its nature, is censored. On 



further consideration we realized that the same formula can be derived directly from the 



mathematics of the stochastic Dollo process, or as a limiting case of the LogDet distance. 
In the following sections we begin by deriving the ADD in a general Dollo context 
and then show why it also applies to the censored Dollo model that arises for DArT data. 
We then describe an intriguing link to the popular LogDet distances. After introducing a 
few other binary distances we present a simulation study that compares the performance 
of the new ADD to other binary distances when applied to Dollo data under a range of 
censoring schemes. As an illustration of our approach we apply the new distance to three 
case studies, two involving DArT data for Eucalyptus species and one using gene content 
information. We conclude with a discussion in which we point out some potential future 
directions. 

Methods 

Deriving an Additive Distance for the Stochastic Dollo Process 



Huson and Steel 



he d escription of a stochastic Dollo process which we follow is that of 
(120041 ) . whose discussion is in the context of the gene content of a genome. This model can 
be described as a constant-birth, proportional-death Markov process. New markers are 
acquired (e.g. genes added to the genome) at rate A, existing markers are independently 
deleted with intensity rate \i. G(t) is the set of markers present at time t. We make 
an initial observation of the set of markers G(s) at start time s. It is assumed that the 
system is at equilibrium at time s, i.e. it has been evolving by the stochastic Dollo process 
for long enough to be independent of initial conditions. The genome then evolves for a 
further time t and we observe the marker set G(s + 1). We then put 

nn = \G(s) n G(s + t)\, n 10 = \G(s) - G(s + t)\, n 01 = \G(s + t)- G(s)\ (1) 

i.e. nn is the number of shared presences (markers present in the genome at both time 
points), ?2io and uqi are the cou nts of markers present a t one time point but not the other. 



Under these circumstances, 



Huson and Steell (120041 ) prove the following facts: 



1. l(s) = \G(s)\ = nn + ^10 is Poisson distributed with mean m = A//x. 

2. If 1(0) is chosen according to this equilibrium Poisson distribution, then the process 
G(t) is a time-reversible Markov process. 

3. tiqi is Poisson distributed with mean m(l — e _/ii ). 

4. riiQ is binomially distributed with /(s) trials each with probability 1 — e~^ of success, 
where a "success" in this context is the loss of a marker. 

From 4 we have expected value E(n w /(nii + rii )) = (1 — e~ Mt ) and we solve for t: 

t = --log£(l-^M (2) 

and so we get a distance <i that is proportional to time given by 

d=-log(l ^!_) =tog (!Sl±^). (3) 

V Tin + 7lio/ V n ll / 

Moreover, d is a statistically consistent estimator for fit (i.e. d will converge on /it as we 
collect more data.) Note that by the time reversibility property, we could equally well use 
d = log((nn + 71oi)/tiii). To make use of all available data (both nio and n i) we average 
these two distances to give 

dADD = tog ^ K,+» 1 o)(»n + »c 1 ) j (4) 

and call this the Additive Dollo Distance (ADD). 



Dollo models with censored data 

As mentioned in the introduction, some datasets of interest have an additional property 
whereby only markers which are present in the reference taxon or taxa can be detected. 
This is referred to either as "censored data" or as an "ascertainment bias". We want to 



extend the ADD distance to cases of both single and multiple reference taxa. As we will 
see below, the first case does not require any change in the formula. 

We introdu c e our approach for censored data that arise in the context of DArT 



f Jaccoud et al. 



20011 ). DArT uses particular restriction enzymes to create a "genomic 
representation" (DNA fragments typically 300-1000 bp in length) from one or more refer- 
ence taxa. The restriction enzymes used consist of a 'rare cutter' (e.g. Pst I, that cuts at 
sequence CTGCAG) and a 'frequent cutter' (e.g. BsfN I, that cuts at sequence TCGA.) 
Only fragments cut at both ends by the rare cutter become markers. The fragments are 
cloned and are then arrayed onto a glass microscope slide. Genomic representations are 
prepared for study samples using the same restriction enzymes. The study samples are 
screened (via DNA-DNA hybridisation) with the array. The presence or absence in the 
sample of the DArT markers on the slide is recorded to produce a binary dataset. 

A DArT marker can be lost during evolution by mutations which disrupt the rare 
cutter target at either end of the marker, or introduce a new rare or common cutter 
target within the marker. Once lost, a marker can only be regained by reversing the 
mutation which caused loss, before another loss-causing mutation occurs. This is a rare 
event, so we can model DArT marker gain/loss as a Dollo process. 

The data from a DArT analysis suffer from ascertainment bias: only markers which 
are present in the reference taxon or taxa (from which the array was prepared) can be 
detected. This can distort distances, depending on the proximity of the taxa to the 



reference. For example, the Hamming distance 

nio + n m 



in 



n o + n i + nio + n n 



(5) 



will underestimate distances between taxa distant from the reference, as both taxa will 
have few markers in common with the reference, hence noo will be large. This illustrates 
a general theme: joint absences (noo) do not carry the same meaning as joint presences 
(nn) and should not be accorded equal weight in the distance formula. 

The most appropriate way to infer phylogeny from DArT data is not obvious, and pre- 
vious authors have taken a variety of approaches, often analyzing the same data multiple 

6 



ways. The dist ance of iNei and L 



for DArT data (IJames et al 



2008 



( 1 9 7 9J) is the most common 



Wenzl et al 



2004 



Xia et al 



y used distance measure 



20051 ). It is designed for 



restriction site data, which DArT does produce, but for fragments generated by a single 
restriction enzyme, rather than the two enzymes used by DArT. Furthermore, the Nei 
Li distance does not account for the ascertainment bias caused by observing only those 
markers that are present in the referenc e taxa. Other dist ances that have been used are 



LogDet (IJames et al. 



2008)) a nd DICE f|Mace et al. 



inclu de maximum parsimony ( 



ysis (IJames et al 



2008 



James et al. 



Steane et al. 



2008 : 



2008 ). Non dist ance-based methods 



Steane et al. 



20111 ) and Bayesian anal- 



201 ll ). Sometimes a principal coordinate analysis is 



done in place of, or in a ddition to, a phylogenetic analysis ( Uing et al. 



2009 



James et al. 



2008; 



Yang et al. 



20061 ). 



Consider DArT data for taxa A and B generated from an array constructed with 
markers taken from taxon R. The presence/absence data for A and B are summarized 
by the counts uqq, %i, n-io and n\\ (where n§\ is the number of markers present at B but 
absent at A, etc.). Consider the unrooted phylogenetic tree for A, B, and the reference 
taxon R. We denote the unique internal node of this unrooted tree X. As all markers 
in the dataset are present at R, and the markers evolved by a Dollo process, any marker 
present at A or B must also be present at X. In particular, a marker present at A 
is present at X, and the fraction of these markers which are lost between X and B is 
n\o/{n w + riu). In view of the arguments above, 



E(n w /(n w +nn)) 



']_ _ g-^XB 1 



(6) 



where txB is the time (branch length) between X and B. From this, we find a distance 
proportional to time 

d(X,B)=\og(^±^-)oct XB . (7) 



V nu 
By symmetry, we also have d(X, A) = log((n i + flu)/ Tin) and so 



At a m my a\ i my m i f{nii + n 10 )(nu+noi)\ , 
d(A, B) = d(X, A) + d(X, B) = log — ) = d ADD 



n 2 u 



and, once again, we arrive at the formula for (Iadd- (Note that this formula will also work 
if one of A or B is the reference taxon R.) 

In this derivation, we assume the existence of a point X in the tree such that any 
marker present at A or B is present at X. Under a Dollo process, this assumption is 
valid if there is a single reference taxon, or if there are multiple reference taxa which are 
monophyletic with respect to A and B. Th is will not generally be the case, and indeed it 



20111 ) we will look at in one of 



is not the case for the Eucalyptus dataset (jSteane et all 
the case studies. 

The potential problem with the ADD for data generated from multiple reference taxa 
is illustrated by figure [TJ If we consider only the characters which are present at R (the 
leftmost 8 columns at each node), we get the correct answer: n\\ = 2, noi = 2, nio — 6, 
d>ADD = log2(4 x 8/2 2 ) = 3 (taking logarithms as base 2 for convenience.) (There is a 1/8 
chance of a marker at A surviving to B, so the log 2 path length between A and B is 3.) 
However, using the full set of data (which includes S as an additional reference taxon), 
now we have n n = 3, n m = 7, nio = 7 and (Iadd — log 2 (10 x 10/3 2 ) = 3.474. Not all 
taxon pairs will suffer this bias, for example in figure [TJ if we added a taxon C attaching 
at point r, the distance cIadd{A,C) would be unbiased. 

If we know which markers are derived from which reference taxa, then we can partition 
the data by marker reference taxon, calculate an ADD distance matrix for each partition, 
and then form an average distance matrix from the partition distance matrices. Under the 
plausible assumption that the sampling variances in the partition distances are inversely 
proportional to the number of markers in that partition, the minimum variance estimator 
is obtained by a weighted average, with weights proportional to the square root of the 
number of markers in the partition. We call this weighted average of partitioned distances 
the Partitioned Additive Dollo Distance (PADD). 

The derivation of the PADD assumes that the markers chosen from each reference are 
an independent and unbiased sample from the markers present in that reference taxon. In 
practice, this assumption may fail for two reasons. When constructing the DArT array, we 
attempt to eliminate redundancy - so a marker selected for the array from one reference 



taxon precludes the same marker being selected from a second reference taxon in which 
it may also be present. (Some redundancy is kept deliberately as an internal control, but 
omitted from the final alignments.) Also, markers are chosen which show useful levels of 
polymorphism. A marker present in all taxa is not useful, nor is one which is present only 
in the reference taxon from which it was derived. 



Links to Conditioned Genome Reconstruction: The Additive Dollo 
Distance as a Limiting Case of LogDet 

Another context in which Dollo models may be appropriate is gene-family presence/absence 



data. Such dat a are increasing 



COG database ( jTatusov et al. 



y ava ilable as more and more genomes are studied. The 



20031 ) sorts genes from 50 bacteria, 13 archaea and 3 eu- 



karyotes into nearly 5000 gene families. The gene f amily presence /absence data have been 



used for phylogenetic inference by s e veral authors flLake and Rivera 



2007 : 



Cotton and Mclnerney 



2008; 



Sangaralingam et al. 



2004; 



Spencer et al. 



20101 ). but Dollo models have 



not, to our knowledge, been applied. A problem noted by previous analysts of these 
data is that the taxa vary greatly in the number of gene families they contain. Thus, 
an unsophisticated analysis would be biased towards grouping together taxa that have 
small genomes. This is somewhat similar to the problem of inferring a phytogeny in the 



presence of base- 



1994; 



Lake 



requency biases, which is an issue th e LogDet distance 



Lake and Rivera 



JLockhart et al. 



(120041 ) sought to 



19941 ) was designed to overcome. When 
apply the LogDet distance to data from the COG database, they realised they needed 
to know the number of shared absences (noo)- This in turn required the definition of a 
"universal" set of gene families. They achieved this by selecting a 'conditioning genome', 
and taking the set of gene families present in that g e nome as being the "universal" set. 

r T I I 

In using the LogDet distance, lLake and Riveral (120041 ) explicitly assumed that gene 
family presence/absence is a Markov process where a gene family can disappear from a 
lineage and later reappear. Frequent horizontal gene transfer (HGT) was used to justify 
this assumption. That LogDet treats shared presences and absences (noo and Tin) sym- 



metrically, when physically they have very different meaning, seems to us to be a potential 



weakness in their method that has not been commented on previously. 

Lake and Rivera's assumption (that any gene family can be gained by any taxon at 
any time) can be thought of as one extreme. The opposite extreme is to discount the 
possibility of HGT completely, and adopt a Dollo model. Consider the standard (non- 
Dollo) multiple site two state Markov model: we have iV sites evolving independently 
between two states ('present' and 'absent') by a continuous time Markov process. The 
rate for absent— ^-present transitions is a and for present— ^-absent transitions is \x. As 
before, we sample this process at two different times and get counts n n , n 10 and rioi of 
shared presences, and of presences at one time point but not the other. In addition, we get 
noo, the number of shared absences. In this two state case, the LogDet distance formula 

is 

, 1, f noo^n - ^oi^io \ . . 

d L ogDet = -- log , . (9) 

1 \ V ("oo + nio)(noi + %i)(«oo + "oiX^io + n n )J 
If we now allow N to vary, we can construct a Markov process so that in the limit N — ¥ oo 
it becomes a Dollo process. We require the rate of creation of new markers (sites in the 
'present' state) to be A, so we set A = Na. (The loss rate \x is independent of N.) As 
N — y oo (and a — ¥ 0), the distributions of Tin, n w and n i will remain finite and converge 
on those for the Stochastic Dollo process described above. As tin, ^10 an d noiare finite, 
we must have n o — > oo. Letting n 00 — ¥ oo in equation [9] 

,. , ,. 1, I ^oo^ii 

hm d LogDet = hm --log — 

n00 ^ o ° n00 ^ o ° 2 \yjnoo(n 01 + mi)nooKo + nn) , 



1 i„„ (( n u + n w)(nu +n i) \ 

(10) 



A l0g I 

dADD 



showing that in this limit, dL og D et is proportional to dADD- 



10 



Comparison of binary distances 

Various distances have been defined in the literature for presence/absence data. We have 
picked a number of these to compare to ADD including: 



fractional Hamming distance dn 



Jaccard distance d , 



DICE Distance doicE 



LogDet distance dLogDet 



nio + n i 



n o + noi + n 10 + mi 

nip + ^01 

woi + nio + n n 

n w + n 01 

noi + n w + 2n n 
1 



log 



n 00 nn - n in 10 



y^noo + n 10 )(n i + n n )(n o + n 01 )(n 10 + n 



ii/ 



Nei-Li distance d 



NL 



log(P) where F 



P J 



3-2P 



and F 



In 



ii 



2nu + mo + n i 



flChoi et all l201(t lLockhart et all 11994 : lLakel . Il994i JNei and Li Il979h . 



Huson and Steell (J2004J ) derived a maximum likelihood distance for gene presence/absence 



data under a Dollo process 



d — — lot 



P + V/3 2 + 4a i2 



:n^ 



where in our notation /3 = 1 — (nn + mo + noi) /m, «i2 = rixi/m and m = A//x is the 
expected number of genes per genome. We do not know m, but if we estimate it by the 
mean number of genes/markers at the two taxa m = [(nn + n 10 ) + (nn + n i)]/2 then 
equation [TT1 simplifies to 



d = lot 



2nn + n i + nip 
2nn 



121 



This is a simple transformation of the DICE distance, being —log(l — dniCE), so we name 
it the Log DICE distance. We can intuitively justify this transformation, arguing that 
logarithms correct for multiple events (e.g. gain, loss, mutation) on the same marker. 
We can also perform a similar transformation on the Jaccard distance to create the Log 
Jaccard distance d LJ . So in summary, in addition to the standard distances above, we 



11 



introduce previously unstudied distances: 

, , /-, , v , fn u +n 01 +n w 

d LJ = - logfl - rfj) = log 



n n 
(2n n +n m +n w 

2n n 

'(nn +nio)(nn + n i)\ 



, . ,. , , . / 2wn+noi + nio \ 
Q-LDicE = - log(l - c/djce) = log I I 

gUdD = log 



11 



as well as the composite ADD distance method dpADD, defined above. 

The Triangle Inequality and Additivity 

Two important properties of distance functions (i.e. bivariate, non-negative functions 
d(x,y) with d(x,y) = if and only if x = y, and d(x,y) = d(y,x) for all x, y), are the 
triangle inequality and additivity. The triangle inequality states that d(x, y) < d(x, z) + 
d(z,y) must hold for all x,y,z (in which case d is known as a 'metric'). Additivity 
states that if z was the last common ancestor of x and y, and the sequences evolved 
independently, then d(x,y) = d(x,z) +d(y,z) should hold (on average.) The desire 
for additivity accounts for the presence of the logarithm function in many phylogenetic 
distances. 

Not all of the distances defined above satisfy the triangle inequality - see Table [1] for 
counterexamples to the triangle inequality holding for some of the distances. In Table [2] 
we summarize which distances are additive and which obey the triangle inequality. The 



additive Dollo distance dADD is additive by construction in the st ochast i c Do 



lo context, 



19941 ) . Notably, 



and it is a limiting case of the LogDet distance which is additive (jLakd . 
the only two distances (dn, dj) known to obey the triangle inequality are not additive, and 
the only two distances known to be additive (di og det, dADD) violate the triangle inequality. 
Phylogeneticists place greater value on additivity than on obeying the triangle inequality, 
as demonstrated by the popularity of LogDet. 

It is worth noting that for dij{x,y) and diDiCE{x,y) to be additive, it is necessary 
that they go to infinity as the evolutionary distance between x and y goes to infinity. For 
the stochastic Dollo process, riu = for infinitely separated x and y, so d^j and diDicE 

12 



go to infinity as required, but for a Markov process where E{n\\) > for unrelated (x, y), 
du and d^DicE will tend to a finite limit. We can generalize the formulae to correct for 
this, using 

d L j(x, y) = - log(6 - dj(x, y)) 

where b is the expected value of dj evaluated on uncorrelated/infinitely separated se- 
quences (and a similar formula applies for diDiCE)- For example, for a Markov process 
where states and 1 are equally likely at equilibrium, we have b = 2/3 for d^j and b = 1/2 

for dLDICE- 

Simulating censored Dollo data 

The general scheme for our simulations is to create a random tree, simulate a Stochastic 
Dollo process along it, select reference taxa and, finally, select which markers are used to 
produce an alignment (on the basis of which markers are present at the reference taxa). 
We generate clock-like and non-clock- like trees. F or the non-clock-like trees, we gener- 



ate the tree topology by a Yule process ( YuL 



1924 ). then branch lengths are set so that 



they are distributed uniformly between lengths 0.05 and 0.40. For clock-like trees, we gen- 
erate a tree by a Yule process with mean branch length 0.1, and repeat this process until 
we obtain a tree whose shortest branch is no shorter than than 0.01. (As short branch 
lengths are hard to resolve no matter how good the phylogenetic method, keeping such 
branches reduces the contrast between "good" and "poor" methods, which would make 
our simulation results harder to interpret.) Our simulated data are based on both 9- and 
15-taxon trees. 

For a given simulation run, we specify the expected number of markers pe r genome, 



m. As a Dollo process in equilibrium is time reversible (iHuson and Steel 



20041 ) . we start 



the process at an arbitrary taxon, with the number of markers at that taxon drawn from 
a Poisson distribution with mean m. Then we propagate the set of markers through the 
tree. On each branch with length b, existing markers are lost with probability 1 — e~ h each 
and the number of new markers created has a Poisson distribution with mean m(l — e~~ b ). 
We have a number of different models for selecting the markers that will be included 

13 



in the alignment to be analyzed, and (for the PADD method) how the markers are par- 
titioned. 

incll (One reference taxon, included.) One taxon is chosen as a reference. Only markers 
present at that taxon are selected. There is only one partition of the markers. 

excll (One reference taxon, excluded.) As 'incll', except we discard the reference taxon 
from the alignment. 

incl2 (Two reference taxa, included.) Two taxa are chosen as references. All markers 
present at either reference taxon are selected. For partitioning, a marker which is 
present at both reference taxa is assigned randomly to the partition of one of them. 
Markers which are present at only one reference taxon go into that taxon's partition. 

excl2 (Two reference taxa, excluded.) As 'incl2' except we discard both of the reference 
taxa from the alignment. 

all (All taxa are references.) All markers are included in the alignment. Each marker is 
assigned randomly to the partition of one of the taxa at which it is present. (We 
have as many partitions as taxa.) 

p2inc (Two reference taxa, included, predetermined partitioning.) Two reference taxa are 
chosen. Each marker is assigned randomly to the partition of one of the references. 
Only markers which are present at their partition's reference taxon are included in 
the alignment. 

p2exc (Two reference taxa, excluded, predetermined partitioning.) As 'p2inc', except 
the two reference taxa are discarded from the alignment. 

p_all (All taxa are references, predetermined partitioning.) Each marker is assigned 
randomly to the partition of one of the taxa. Only markers which are present at 
their partition's reference taxon are included in the alignment. 

For the methods which discard reference taxa (excll, excl2, p2exc) we simulate extra taxa 
at the tree generation stage to account for the taxa which will be discarded. 

14 



In the 'incl2', 'excl2' and 'all' models, if a marker is present in any reference taxon, 
it is included in the analysis. This simulates the circumstance when all possible markers 
found in the reference taxa have been included on the DArT array. The 'p2inc', 'p2exc' 
and 'p_all' models simulate the situation where the number of possible markers is very 
much greater than the number we can put on the array, so we get an independent random 
sampling of markers from each reference. 

The models do not all produce the same quantity of data. Compared to the expected 
number of markers present at each taxon, the expected number of markers analyzed is 
equal for the 'incll', 'p2inc' and 'p_all' models, lower for 'excll' and 'p2exc', higher for 
'incl2', several times higher for 'all', and for 'excl2' it depends on the tree, but for our 
data is higher on average. 

For each scenario (number of taxa, clock like or not, the eight data selection models) 
we generate and analyze 5000 random trees. 

Once a distance matrix has be en calculated, the be s t tree is found by minimum evolu- 



tion, using the program FastME ( JDesper and Gascuel 



20021 ). In addi tion, we obtain th e 



most Dollo parsimonious tree using the "dollop" program from Phylip (jFelsenstein 



20051 ). 



To measure the accuracy of different methods we record the proportion of splits in the 
generating tree that were present in the tree inferred by FastME. 



Trees in this paper were plotted using the Interactive Tree Of Life fjLetunic and Bork 



20111 ). and interactive versions may be viewed online at http://itol.embl.de/shared/mdw 



Nexus files containing the raw data and tree files are available on TreeBase at 



http://purl.Org/phylo/treebase/phylows/study/TB2:S12439. 



A demonstration Perl program, and instructions on its use, for calculating the ADD, 
log Jaccard and log DICE distances is included in the supplementary material. 
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Results 

Simulated data 

Table [3] shows the proportion of splits (i.e. edges) in the reconstructed trees which were 
incompatible with the true tree. (Additional tables for differing numbers of characters and 
number of taxa are provided in the supplementary material, tables 1-10.) The LogDet, 
Hamming and Jaccard distances consistently perform very poorly. ADD has the best 
overall performance, producing either the best results, or results that are not significantly 
different from the best results in six of the eight models; it also gives quite reasonable 
results in the remaining two cases. PADD has six near-best results, but fares worse on 
the remaining two. (The 'all' model violates the assumptions of PADD. Possibly 'p_all' 
performs poorly because there are so few markers in each partition.) Log Jaccard is not 
far behind the leading methods. DICE, Nei-Li and LogDice round out the middle of the 
field. Table H] shows rather different results for the clock-like trees, with the best distances 
being Jaccard, DICE, Log Jaccard, PADD then ADD. We were somewhat surprised that 
the Jaccard distance did so well here given its poor performance on the non-clock-like 
trees. Variation between distance methods is smaller, as Yule trees have some very short 
branches, which are hard for any method to resolve. LogDet performs consistently poorly 
for both clock-like and non-clock-like trees. 

Figure [2] plots the accuracy of branch length reconstruction against sequence length. 
The Hamming, LogDet, Jaccard and (to a lesser extent) DICE distances all show signs of 
ceasing to improve with increasing sequence length. This is expected when the method's 
bias exceeds the sampling error. Only ADD improves at the optimal rate (lower dotted 
line.) 

In figure [31 we investigate the possibility of bias in the distances due to tree shape. We 
divide the true trees according to how many cherries are in the unrooted tree. (A 'cherry' 
is an internal node directly connected to exactly two leaves.) The minimum number is 2 (a 
maximally unbalanced or 'caterpillar' tree). For 15 taxa, the maximum is 7. Two-cherry 
trees were too rare to get reliable statistics, so figure |3] shows results for 3 to 7 cherries. 



16 



If a method is biased towards producing unbalanced trees, it will be more accurate when 
the true tree is unbalanced than when it is balanced. The non-logarithmic methods 
(Hamming, Jaccard, DICE) generally have high error rates on unbalanced trees, indicating 
a bias in favour of balanced trees. For the other methods, there is no obvious consistent 
bias. For example, Nei-Li, LogDice and ADD are slightly biased towards unbalanced 
trees for the non-clock like tree simulations and towards balanced trees for the clock like 
tree simulations. More plots demonstrating this lack of consistency are provided in the 
supplementary material, figures 1-6. 



Applications to real data 



Case Study 1 - Eucalyptus DArT data 



Steane et al. 



(120111 ). This includes 94 



We analysed the DArT Eucalyptus dataset of 
species of Eucalyptus from across the full taxonomic range (excluding Corymbia). The 
dataset comprised 7490 non-redundant DArT markers (newly acquired sequence data 
have allowed us to eliminate 864 markers as redundant, reducing the number of markers 
from the 8354 reported by Steane et al. 2011). This dataset was generated during the 
development phase of the Eucalyptus DArT array and only about 32% of the mark- 
ers in this dataset are included on the final, publicly available Eucalyptus DArT array 



(jSansaloni et al 



2010) 



Analysis of this data using the PADD distance yielded the tree shown in Figure HI 
Branch support for this and subsequent trees are from 1000 nonparametric (resampling 



randomly with replaceme nt) bootstraps. Rooting on E. curtisii (su 



based on previous studies (jDrinnan and Ladiges 



1991 



jgenus Acerosae) was 



Steane et al. 



20021 ). The topology 



of the PADD tr ee was highly concordant with the most recently published classification 



Brooker 



20001 ) and previous molecular studies using ITS sequence data ( ISteane et al. 



2002 ). The tree w as also highly congruent with a cladistic analysis of the same data 



( Steane et al. 



201 lh . but the PADD tree provided increased resolution at some key nodes. 
For example, sections Latoangulatae (SL), Exsertaria (SE) and Racemus (SR) in the 
PADD tree form a cluster that is distinct from all other sections, largely in agreement 
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with results 



DArT data (ISteane et al. 



'rom ITS sequence data of ISteane et al.l (120021 ) . The cladistic analysis of the 



201 ll ) did not resolve the relationships between these sections 
and section Maidenaria. However, the position of section Racemus in relation to sections 
Latoangulatae and Maidenaria remains equivocal, with cladistic analysis of the DArT 
data and cladistic analysis of ITS sequence data placing it close to section Maidenaria. 
The bootstrap values on the branches of the PADD tree are generally high compared 
to those obtained by the cladistic analysis. Bootstrap values tended to be higher on 
internal nodes where there were longer branches (i.e., splits with more character support), 
although bootstrap values were gene rally >50% even wh en branches were very short. This 



contrasts with the cladistic analysis (JSteane et al. 



20111 ) where many branches throughout 



the cladogram had <50% bootstrap support. 



Case Study 2 - Adnataria (Eucalyptus) DArT data 



We screened 90 species of Eucalyptus from section Adnataria (IBrooker 



20001 ) plus three 



outgroup taxa (E. cornuta, sect. Bisectae; E. torquata, sect. Dumari a; and E. staeri 



sect. Longistylus) using the publicly available Eucalyptus DArT array (ISansaloni et al. 



20101 ) . Leaf samples were collected from Currency Creek Arboretum, South Australia; 



details of the samples will be given in a subsequent paper, but ar e available from th e 



201 lh 



authors upon request. DNA was prepared as described previously (ISteane et al 

DArT an alysis was conducted b y DArT P/L (Canberra, Australia) using their standard 



protocol (ISansaloni et al. 



20101 ). 



Section Adnataria (subgenus Symphyomyrtus) includes 100-130 terminal taxa of which 
90 were included in the DArT analysis. Because of the large amount of potential homo- 
plasy in DArT data, it is preferable to include as much genetic variation as possible 
from within the study group, in order to minimise the risk of long-branch attraction and 
misleading results. Accordingly, t he samples in the study represented eight of the nine 



series delineated by iBrookerl (120001 ) and represented the full geographic distribution of the 
section. (DArT data for E. dawsonii, the single species in series Dawsonianae, were not 
available). Of the 7680 markers on the DArT array, 3707 provided potentially phylogenet- 
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ically informative data, of which 1230 were later found to be redundant and removed from 
the analysis, leaving 2477 markers to be analysed. The topology of the P ADD tree show n 



2000|) 



in figure [5] is not entirely congruent with the established classification f jBrooker 
but it does have some interesting features. While at first glance it appears that - apart 
from series Aquilonares - most of the series within section Adnataria are polyphyletic, on 
closer inspection the series cluster into more-or-less discrete groups. Series Rhodoxylon 
and Siderophloiae form a distinct cluster (apart from the intrusi on of E. r umm eryi, series 



20001 ). Most of 



Buxeales) even though these series are in different subsections (jBrookerl . 

series Buxeales and supraspecies Mollucanae form a discrete cluster that is "sister" to a 

cluster comprising series Heterophloiae, Melliodorae an d Subme ll iodor ae (and one species 



Brookerl (120001 ) to be particularly 



from series Buxeales), none of which are considered by 
close. Close examination of the data reveals that there may be a biogeographic aspect to 
the clusters identified by the PADD algorithm. This is being explored in a separate paper 
(Steane et al., in prep.). 

Notable features of figure [5] are the shortness of the internal branches and the poor 
bootstrap values. The internal edges of this tree have mean bootstrap support of 29% 
and ratio of internal to total edge lengths ("stemminess", Fiala and Sokal, 1985) of 0.052. 
Compared to the Adnataria, phylogeny (figure EJ), the Eucalyptus phylogeny has much 
higher bootstrap supports (mean 81%) and higher stemminess (0.138). The Adnataria 
dataset differs from Eucalyptus in that it has fewer markers (2477 compared to 7490) and 
its taxa are much more closely related to one another (i.e., they span a much narrower 
taxonomic range). To test whether the structural and statistical differences between the 
trees were simply a function of sample size (i.e., the number of markers), we generated 100 
samples of 2477 markers from the Eucalyptus dataset (randomly chosen with replacement) 
and then bootstrapped each sample 1000 times. The resampled Eucalyptus data had 
mean bootstrap support of 60% and mean stemminess of 0.153 (std. deviation 0.003), 
demonstrating that the differences are not simply due to sample size. We conclude that 
the short internal branch lengths in the Adnataria phylogeny may be indicative of the 
aftermath of a period of adaptive radiation that produced many species over a short 
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period of time, or of a loss of phylogenetic signal as a result of hybridization (or both). 
If a marker is monomorphic (all or all 1) within a dataset, the DArT data-processin g 



Jaccoud et al. 



2001)) 



cannot distinguish whether it is present or absent (fig. 4 of 
Monomorphic data are automatically excluded from the final dataset. This is a potential 
problem with the Adnataria dataset, as it is much less genetically diverse than the full 
Eucalyptus genus from which the array was developed. We are not concerned by markers 
which are absent from all samples being omitted from the data, as they are ignored by 
the ADD formula, but there is a prospect that some markers which are present in all 
taxa have been omitted. To test the effect of such an omission, we added to the dataset 
500 markers that were scored as present in all taxa, and reanalysed. The distances were 
uniformly smaller, there was very little change to the topology, bootstrap values or stem- 
miness. We conclude that this potential for missing data does not materially affect our 
results. (The trees for this test are shown in the supplementary material, figures 7 and 



Case study 3 - Gene Family Bacterial Phylogeny 



n thi s case study we reanalyse the gene presence/absence data used by 



Spencer et al. 



2007j). A potential weakness of the conditioned genome reconstruction approach which 



has attracted attention recently is the influence of the choice of conditioning genom e 



Spencer et al. 



(120071) 



on the outcome. The most sophisticated attempt to address this 
used many conditioning genomes, produced a tree for each one, and finally constructed a 
consensus tree from these. The tree they derived for 40 of the bacterial genomes is shown 
in figure [6j 



n r 



n r 



It has been noted previously (jLake and RiveraU2004USpencer et al 



20071 ) that parallel 



loss of genes required for free living causes parasitic bacteria from all bacterial phyla to 
falsely group together in this analysis. 

As we have derived ADD as a special case of LogDet that is applicable to a Dollo 
process, we c an analyze these COG data without the need for a conditioning genome. In 



effect, where 



Lake and Riveral (12004 ) use LogDet distances and a conditioning genome to 
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find riQo, we use LogDet distances assuming tiqq 
phylogeny is shown in figure [71 It differs from the 



is infinite (equation fTUl ) The resulting 



Spencer et al 



(120071 ) phylogeny on only 



three edges, two of which have less than 50% bootstrap support in their tree. 



Discussion 



ADD is a simple, consistent distance suitable for use with binary data which evolve under 
a stochastic Dollo process. We have shown that ADD is also consistent for data which 
has been censored in such a way that only traits that are present in a single reference 
taxon are observable. Censoring by multiple reference taxa is more complex but in the 
case where each trait is known to derive from a particular reference taxon (e.g. in the 
DArT data presented above), ADD can be extended by a straight-forward partitioning 
scheme (PADD). 

Our simulation supports the expectation from theory that ADD and PADD should 
perform well on data generated under a stochastic Dollo model with various censoring 
schemes. By contrast, several other distances that are in common use for both DArT 
data and for analysis of gene-content data do not perform well in the simulations. This 
suggests that they should probably not be used for data which are thought to be generated 
under a Dollo model. We used DArT data and gene presence/absence data as illustrations 
of our new approach, but our distances can be applied to any data derived from complex 
traits that are unlikely to evolve more than once independently. 

More specifically, our simulations indicate that LogDet performs very poorly on 



stochastic Dollo data. 



genome reconstruction (ILake and Rivera 



'his is a concern 



or the use of LogDet dist ances in conditioned 



2004 



Spencer et al. 



20071 ). The use of LogDet 



on gene presence/absence data in deep bacterial phylogeny not only necessitates an ex- 
tra layer of complication with conditioning genomes to find the number of shared ab- 
sences, but also depends critically on horizontal gene transfer (HGT) being sufficiently 
common to justify the use of a Markov model. For the gene-content phylogeny based 
on the COG dat a, it is interest i ng th at the tree found using ADD is very similar to 



the tree found by 



Spencer et al. 



(120071 ). Indeed the two approaches have very different 
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underlying assumptions. ADD ignores the possibility of HGT, which certainly does oc- 
cur for these data, and the augmented conditioning genome reconstruction method of 



Spencer et al 



(120071 ) ignores the Dollo aspect of the data treating shared absences and 



sharec. 



presences equivalent l y. Bo th methods recover the (presumed artefactual) parasite 



clade ( jSangaralingam et al. 



20101 ). We could imagine an intermediate model for this type 
of data, where markers primarily evolve via a stochastic Dollo process, but are occasion- 
ally subject to horizontal gene transfer. In this situation, a suitable distance might be the 
LogDet distance, with n Q0 augmented, but still finite. 

ADD also appears to be an appropriate distance to use for DArT data, and we have 
applied it to two closely related datasets. The results obtained using the PADD algorithm 
for the Eucalypt us data aresatisfyingly congruent with traditional morphology-based 



taxonomies (e.g. 



Brooker 



20001 ) and phylogenies based on ITS sequence data. In addition, 
the PADD analysis of DArT data appears to somewhat improve on the parsimony-based 
analysis as it provides more resolution, possibly because the parsimony-based approach 
does not take account of branch length information and can thus be more easily misled 
by homoplasy. 



In pilot studies we demonstrated that DArT dat a had the po t entia 



tionships among closely related Eucalyptus species ( ISteane et al. 



to resolve rela- 



201 lh . This level of 



resolution has always been elusive to eucalypt systematists for various reasons in cluding 



recen t /incomplete speciation and the high incidence of inter-specific hybridisation ((Byrne 



20081 ). To test the efficacy of the partitioned additive Dollo distance at this level of di- 



vergence we applied it to a set of DArT data for section Ad nataria. The PA DD-derived 



Brookerl (120001 ) are not ro- 



Adnataria phylogeny suggests that the series delimited by 
bust groups. However, the lack of bootstrap support and the short branch lengths do 
not provide us with confidence that PADD analysis of the DArT data has uncovered the 
"true tree". Further analyses of this dataset (currently underway) may reveal patterns of 
variation in other traits (e.g., morphology, physiology, biogeography) that will inform us 
about the plausibility of this DArT-based phylogeny derived using the PADD algorithm. 
With the increasing debate about the appropriateness of the Tree of Life metaphor for 
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many domains of life, it would be timely to attempt to extend the basic Dollo model to 
incorporate borrowing of traits. Indeed, both the bacterial gene-content example and the 
Adnataria example are cases where Dollo models that include "borrowing" of traits would 
be very appropriate. Another avenue to pursue would be the develo pment of consistent 
dista nces in the case of multi-state Dollo models (such as discussed by 



Alekseyenko et al. 



2008). With the ever-increasing abundance of genomic data, finding good models for the 



evolution of complex traits is as appropriate now as it was in the time of Dollo. 
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Tables 



Distance 


DICE 


LogJaccard 




LogDICE 




LogDct 


ADD 




A:10 


A:00111 




A:001 




A:00101 


A.-011 


Alignment: 


B:01 


B:11001 




B:lll 




B:00011 


B:101 




C:ll 


C:lllll 




C:011 




C:00001 


C:lll 


d(A,B) 


1 


log(5) = 1.61 




log(2)=0.69 




log(6)=1.79 


log(4)=1.39 


d(A,C) + d(B,C) 


2/3 


2 log(5/3)=1.02 


log(l 


.5)+log(1.25)= 


=0.63 


log(8/3)=0.98 


2 log(1.5)=0.81 



Table 1: Counterexamples to the Triangle Inequality 



Distance Hamming Jaccard DICE LogJaccard LogDICE Nei-Li LogDet ADD 



Triangle ineq. 
Additive 



Yes a 
No 



Yes b 
No 



No 
No 



No 
Unknown 



No 
Unknown 



Unknown 
Unknown 



No 
Yes c 



No 
Yes 



a: IHamminelll950t b: ILipkusI 11999; c: lLakdll994 

Table 2: Summary of the mathematical properties of the distances tested in this paper. 



Method 


incll 


excll 


incl2 


excl2 


all 


p2inc 


p2exc 


p all 


LogDet 


n/a 


0.0624(15) 


0.3472(227) 


0.1313(64) 


0.0824(273) 


0.2616(121) 


0.1242(42) 


0.0950(63) 


Hamming 


0.1204(51) 


0.1047(35) 


0.1008(59) 


0.0857(37) 


0.0370(121) 


0.0966(36) 


0.0931(28) 


0.0457(26) 


Jaccard 


0.0649(22) 


0.0699(19) 


0.0508(25) 


0.0578(21) 


0.0665(220) 


0.0600(17) 


0.0679(16) 


0.0512(30) 


DICE 


0.0435(10) 


0.0464(7) 


0.0246(7) 


0.0288(5) 


0.0381(125) 


0.0347(4) 


0.0402(3) 


0.0231(9) 


Nei-Li 


0.0549(16) 


0.0541(11) 


0.0438(20) 


0.0405(11) 


0.0011(2) 


0.0398(7) 


0.0446(6) 


0.0168(4) 


LogDICE 


0.0497(14) 


0.0495(9) 


0.0363(15) 


0.0337(8) 


0.0006 


0.0362(5) 


0.0398(3) 


0.0140(2) 


LogJaccard 


0.0406(9) 


0.0416(5) 


0.0230(6) 


0.0222(1) 


0.0017(4) 


0.0293(2) 


0.0329(0) 


0.0117 


ADD 


0.0241 


0.0308 


0.0303(11) 


0.0269(4) 


0.0006 


0.0273(1) 


0.0326 


0.0139(2) 


PADD 


0.0241 


0.0308 


0.0146 


0.0206 


0.0082(25) 


0.0260 


0.0354(1) 


0.0341(17) 


DolloP 


0.0396 


0.0363 


0.0236 


0.0228 


0.0004 


0.0245 


0.0296 


0.0074 



Table 3: The proportion of incorrect splits in minimum evolution trees derived from 
the various distances. Columns show the results for each of the various models of data 
censoring. Numbers in parentheses indicate how many standard deviations worse this 
result is than the best result for this model. Within each column, for the distance- 
based methods, the best (lowest) value is shown in bold, along with any which are not 
significantly worse than the best value. Distance-based results are colour coded, from 
green (best in column) to red (at least 20 standard deviations worse than the best.) Also 
included for comparison are the Dollo parsimony results. These data are derived from 
simulations using 15 taxa and a mean of 500 markers per taxon, with 5000 random trials. 
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Method 


incll 


excll 


incl2 


excl2 


all 


p2inc 


p2exc 


p all 


LogDet 


n/a 


0.0677(19) 


0.2309(139) 


0.1663(88) 


0.0223(24) 0.1769(80) 


0.1326(50) 


0.0811(31) 


Hamming 


0.0432(8) 


0.0401(6) 


0.0355(12) 


0.0327(8) 


0.0044(1) 


0.0365(6) 


0.0385(4) 


0.0253(1) 


Jaccard 


0.0272 


0.0287 


0.0162 


0.0189 


0.0039 


0.0248 


0.0295 


0.0235 


DICE 


0.0325(3) 


0.0336(2) 


0.0209(3) 


0.0240(3) 


0.0042(0) 


0.0285(2) 


0.0337(2) 


0.0259(1) 


Nei-Li 


0.0408(7) 


0.0420(7) 


0.0399(15) 


1 0.0369(11) 


0.0069(4) 


0.0365(6) 


0.0408(5) 


0.0314(4) 


LogDICE 


0.0393(6) 


0.0407(6) 


0.0364(13) 


0.0347(9) 


0.0062(3) 


0.0353(6) 


0.0395(5) 


0.0308(4) 


Logjaccard 


0.0340(3) 


0.0357(3) 


0.0254(6) 


0.0275(5) 


0.0049(1) 


0.0302(3) 


0.0352(3) 


0.0269(2) 


ADD 


0.0336(3) 


0.0364(4) 


0.0366(13) 


0.0346(9) 


0.0062(3) 


0.0336(5) 


0.0390(5) 


0.0310(4) 


PADD 


0.0336(3) 


0.0364(4) 


0.0237(5) 


0.0284(6) 


0.0108(9) 


0.0238(4) 


0.0396(5) 


0.0364(7) 


DolloP 


0.0411 


0.0404 


0.0288 


0.0311 


0.0061 


0.0332 


0.0374 


0.0270 



Table 4: As table [31 but for Yule trees. Unexpectedly, the Jaccard distance performs very 
well in these simulations. Only LogDet performs poorly. 
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Figure 1: Multiple reference taxa. The binary strings indicate the presence or absence of 
DArT markers. There is a 0.5 probability of marker loss along any edge. Only markers 
present at reference taxon R or S are shown on the diagram. Each character (the Is and 
Os for a given marker) occurs a number of times proportional to its probability. 
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Figure 2: Root mean square (RMS) error in branch length estimation plotted as a function 
of the number of characters for the various distance methods. If the method is biased, the 
RMS error cannot fall below the error caused by the bias, and the line is approximately 
horizontal (e.g. Hamming distance.) The grey "l/sqrt(n)" line illustrates the expected 
slope of an optimal method, with variance inversely proportional to sequence length. 
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Figure 3: The effect of tree shape (measured by the number of cherries in the unrooted 
true tree) on split error rate. In each panel, unbalanced trees are on the left, balanced 
trees on the right. (Points are for 3 to 7 cherries.) Downward sloping lines indicate a 
bias towards constructing balanced trees, and upward sloping lines a bias for unbalanced 
trees. The vertical axis is linear and originates at zero, but is of different scale in each 
panel. 
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■E. tetrodonla 




-E. gongylocarpa 



Subgenus 
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—E. lockyeri 
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— E. michaeliana 
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— E. momsbyi 
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Figure 4: Eucalyptus phylogeny derived from DArT data analyzed with PADD distance 
and minimum evolution tree building. 



33 





br- 



— Buxeales 29 
—Buxeales 13 



— Bux9al9S 3 

Bux9ales 16 

Buxeales 20 



—Buxeales 8 

Buxeales 5 

— Buxeales 9 
Buxeales 1 1 



—Melliodorae 1 

Melliodorae 4 

Melliodorae 2 

Melliodorae 3 

Submelliodorae 1 

Buxeales 12 

Heterophloiae 2 



—Buxeales 17 

Heterophloiae 3 

—Buxeales 26 



—Buxeales 19 

Siderophloiae 10 

Siderophloiae 5 

Siderophloiae 3 



— Aquilonares 4 

Aquilonares 9 

Aquilonares 6 

Aquilonares 1 

Aquilonares 7 



—Buxeales 10 




—Siderophloiae 14 

Siderophloiae 19 

—Siderophloiae 1 



—Siderophloiae 9 

-Rhodoxylon 1 5 



— Siderophlc 

Siderophloiae 2 

Siderophloiae 4 
Buxeales 27 



—Siderophloiae 17 

Siderophloiae 12 

Siderophloiae 15 



—Rhodoxylon 4 
—Siderophloiae 7 

S iderophloiae 6 



—Siderophloiae 13 




Figure 5: Phylogeny for Eucalyptus section Adnataria, derived from DArT data using 
PADD distance and minimum evolution tree building. (Species names anonymized for 
arXiv submission only.) 
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Figure 6: Gene fa mily presence/absence phylogeny with bootstrap support from 
Spencer et al.l (120071 figure 9). (Edges are not drawn to scale.) 
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Figure 7: Phylogeny of bacteria by COG gene family presence/ absence data, using the 
additive Dollo distance. Edges are to scale. 
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